(from de Jonge van der Loo)
A data set is a collection of data that describes attribute values (variables) of a number of real-world objects (units). With data that are technically correct, we understand a data set where each value:
This means that for each unit, a text variable should be stored as text, a numeric variable as a number, and so on, and all this in a format that is consistent across the data set with appropriate variable (column) names. I am avoiding to comment the case in which the variable is categorical, as thus it would be necessary to use labels. I'll skip this, but you have further reading about this in the following link: http://pandas.pydata.org/pandas-docs/stable/categorical.html
The Objective is to:
Note that:
We have already seen in previous notes that Pandas already provides libraries that get a file as an input and provide a DataFrame as an output. This is the most suitable way of reading text in Pandas, however, we always can make use of the csv library from Python, converting and converting the data to a DataFrame using any of the DataFrame constructors.
We will cover some of the most common uses of Pandas regarding the csv file format reading libraries, for other file formats and further options, please refer to http://pandas.pydata.org/pandas-docs/stable/io.html.
Nowadays, the most common data format is CSV, which are tabular data files which use the comma separator to divide variable data. There are two main kinds of CSV standards: comma and semi-colon separated files.
In [ ]:
fname = "../data/people.csv"
with open(fname) as f:
content = f.readlines()
print(content[:5])
Pandas provide read_csv function, which takes a csv file as input, and returns a DataFrame.
In [1]:
import pandas as pd
from IPython.display import display, HTML
path_to_file = "../data/people.csv"
df = pd.read_csv(path_to_file)
display(df)
print("Types:")
display(df.dtypes)
print(df.describe())
In [2]:
df.columns
Out[2]:
Some things to note:
Useful options of read_csv:
sep : str, defaults to ',' for read_csv(), \t for read_table()
Delimiter to use. If sep is None, will try to automatically determine this. Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data. Regex example: '\r\t'.
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the data. Default behavior is as if header=0 if no names passed, otherwise as if header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of ints that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None.
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names).
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object to preserve and not interpret dtype.
skiprows : list-like or integer, default None
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
skipfooter : int, default 0
Number of lines at bottom of file to skip (unsupported with engine=’c’).
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files.
na_values : str, list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''.
Another common file format is the xls file that contains an excel spreadsheet. In this case, each cell is a DataFrame cell, and columns are DataFrame columns and the same holds for rows.
In [3]:
import pandas as pd
country_df = pd.read_excel("../data/country_info_worldbank.xls",skiprows=[0,1,2],header=1)
country_df.head(8)
Out[3]:
Some things to note:
Useful options of read_excel:
sheetname : string, int, mixed list of strings/ints, or None, default 0
Strings are used for sheet names, Integers are used in zero-indexed sheet positions.
Lists of strings/integers are used to request multiple sheets.
Specify None to get all sheets.
str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.
Available Cases
Defaults to 0 -> 1st sheet as a DataFrame
1 -> 2nd sheet as a DataFrame
“Sheet1” -> 1st sheet as a DataFrame
[0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
None -> All sheets as a dictionary of DataFrames
header : int, list of ints, default 0
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex
skiprows : list-like
Rows to skip at the beginning (0-indexed)
index_col : int, list of ints, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex
The use of SQL databases as data source is a very useful tool, however it won't be covered in these notes.
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API.
Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database.
Here are some examples:
http://stackoverflow.com/questions/10065051/python-pandas-and-databases-like-mysql
In previous sessions we have seen how to create DataFrames using dicts and lists. This option is still a valid option to load textfiles as a DataFrame, just by open a file as a regular text file and the parsing lines.
To convert the text file to a DataFrame we will follow 5 steps:
We can do this with the common Python approach for reading a text file.
In [17]:
import re
import pandas as pd
fname = "../data/unstructured_data.txt"
with open(fname) as f:
content = f.readlines()
In [2]:
content
Out[2]:
In [18]:
data = [l for l in content if re.match("^(?!#).+",l)]
data
header = data[0].strip("\n")
data = data[1:]
print([header] + data)
In [19]:
data_str = [header.split("/")[1:]]+[l.split(",") for l in data]
data_str
Out[19]:
In [20]:
df = pd.DataFrame(data_str[1:], columns=pd.Series(data_str[0]))
df
Out[20]:
We can use to_numeric. This function accepts the argument error:
errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
If ‘raise’, then invalid parsing will raise an exception
If ‘coerce’, then invalid parsing will be set as NaN
If ‘ignore’, then invalid parsing will return the input
In [6]:
pd.to_numeric(df["Age"],errors="ignore")
Out[6]:
Both DataFrames and Series have methods to apply functions over elements in the structure. These operations take a function (general, user defined or lambda) and apply it to each element value in the DataFrame or Series row or column wise.
For DataFrames we have:
In [5]:
import numpy as np
df2 = pd.DataFrame(np.random.randn(4, 3),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df2)
f = lambda x: x.max() - x.min()
print(df2.apply(lambda x: x.max() - x.min(),axis=0))
print(df2.apply(f,axis=1))
g = lambda x: x**2
print(df2.applymap(g))
In [6]:
np.random.randn(4, 3)
Out[6]:
In [14]:
def my_func(x):
return x.mean()
df2.apply(my_func, axis=1)
Out[14]:
For Series we have:
More about applymap, apply and map: http://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas
In [9]:
s = pd.Series(np.arange(5))
print(s)
print(s.map(lambda x: x**2))
print(s.apply(lambda x: x**2))
(from http://pandas.pydata.org/pandas-docs/stable/text.html)
Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array.
These methods exclude missing/NA values automatically and follow (more or less) the same syntax than the bulitin string object under the str naming space. Lets use the DataFrame from unstructured dataset:
In [21]:
df["Name"] = df["Name"].str.title()
print(df["Name"].str.len())
df
Out[21]:
In [22]:
df.columns
Out[22]:
In [23]:
df.columns.str.lower()
Out[23]:
The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:
In [24]:
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],index=range(3))
df
Out[24]:
Since df.columns is an Index object, we can use the .str accessor
In [25]:
print(df.columns.str.strip())
print(df.columns.str.lower())
These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lowercasing all names, and replacing any remaining whitespaces with underscores:
In [26]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df
Out[26]:
Some things to note:
Notice that if variables can't be coearced to any other type, Pandas will coerce to str. Its important to know how to manipulate str
str operations can also be perfomed using apply.
In [27]:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')
Out[27]:
Elements in the split lists can be accessed using get or [] notation:
In [28]:
print(s2.str.split('_').str.get(1))
print(s2.str.split('_').str[1])
Easy to expand this to return a DataFrame using expand.
In [29]:
s2.str.split('_', expand=True)
Out[29]:
It is also possible to limit the number of splits:
In [30]:
s2.str.split('_', expand=True, n=1)
Out[30]:
rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:
In [31]:
s2.str.rsplit('_', expand=True, n=1)
Out[31]:
Methods like replace and findall take regular expressions, too:
In [32]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3
Out[32]:
In [33]:
s3.str.replace('^.a|dog', 'XX-XX ', case=False)
Out[33]:
We can access directly to all text positions in a text variable using str indexing with [] notation. If any strings are out of bounds, NaN is returned.
In [34]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
print(s.str[0])
print(s.str[1])
In [35]:
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)', expand=False)
Out[35]:
The expand attribute to False returns a Series when possible. If we assign it to True a DataFrame will always be returned. If multiple regular expressions are grouped, the DataFrame is returned in both cases.
In [36]:
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])', expand=False)
Out[36]:
In [37]:
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])', expand=True)
Out[37]:
Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to access tuples or re.match objects. The dtype of the result is always object, even if no match is found and the result only contains NaN.
Named groups like
In [38]:
pd.Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)', expand=False)
Out[38]:
and optional groups like
In [39]:
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)', expand=False)
Out[39]:
Note that any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.
To know more about expand attribute (e.g. when used with Index) refer to: http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings
The extractall method returns every match. The result of extractall is always a DataFrame with a MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the subject.
In [ ]:
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True)
In [ ]:
s.str.extractall(two_groups)
How to access multiindex?
You can think in multindex as a tupled index:
In [ ]:
mi_df = s.str.extractall(two_groups)
print(mi_df.columns)
print(mi_df.index)
print(mi_df.loc[("A",0)])
print(mi_df.T["A",0])
We can use xs accessor with level parameter to filter subindex values:
In [ ]:
extract_result = s.str.extract(two_groups, expand=True)
extractall_result = s.str.extractall(two_groups)
print(extractall_result)
print(extractall_result.xs(0, level="match"))
print(extractall_result.xs(1, level="match"))
In [ ]:
pattern = r'[a-z][0-9]'
pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)
or match a pattern:
In [ ]:
pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)
Methods like match, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:
In [ ]:
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s4.str.contains('A', na=False)
#explain that bool can be sumed
In [ ]:
df1 = pd.DataFrame([[1],[2],[3],[4],[5]],
index=['one','two','three','four','five'],
columns=['number'])
df1
In [ ]:
df2 = pd.DataFrame([['a'],['b'],['c'],['d'],['e']],
index=['one','too','three','fours','five'],
columns=['letter'])
df2
In [ ]:
import difflib
difflib.get_close_matches
df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
df1.join(df2)
As already stated, when not possible to convert (coerce) to a defined data type, Pandas converts to str. We already worked with strings, we know how to apply functions element and column wise, so it won't be difficult to convert types by ourselves.
Pandas provide some functions to convert types:
In [41]:
s = pd.Series(["10","0.9","56","87'6","34'89"])
In [44]:
# we can raise, ignore or coerce
pd.to_numeric(s,errors="coerce")
Out[44]:
In [51]:
pd.to_numeric(s.str.replace("'","."),errors="coerce")
Out[51]:
Then we can use str.replace to solve this problem. Be careful at applying str over a Series. It has to contain obly str object, otherwise it will return NaN.
In [ ]:
pd.to_numeric(s.str.replace("'","."))
In [53]:
df.columns
Out[53]:
If we recover the unstructured DataFrame, now we can convert it to the correct data type:
In [52]:
df["Name"] = df["Name"].str.title().str.strip()
df["Age"] = pd.to_numeric(df["Age"].str.replace("\"",""))
df
In [ ]:
#this does not work
#pd.to_timedelta(df["Time"].str.strip().map(lambda x: "00:"+x),errors="coerce")
#workaround
def str_convert(x):
return pd.NaT if len(x)==0 else "00:"+x
pd.to_timedelta(df["Time"].str.strip().map(str_convert),box=True)
Exercice: Read iqsize.csv using read_csv. Use na_values to assign missing data.
Exercise: Read country_info_worldbank.xls using read_xls. Use skiprows to get rid of empty lines.
Exercise: Try to load papers.lst reading it line by line taking into account that the fields are the following ones:
Exercise: Repeat the previous exercise taking advantage of the str functions.
Exercise: Load books.csv. Extract Author or Unknown. Then work with the title. Erase Irrelevant data and try to get as many authors and date as possible.
Exercise: Convert to Technically Correct Data: iqsize.csv.
Exercise: Convert to Technically Correct Data: country_info_worldbank.xls.